Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add DangerousWorkflow check for imposter commits. #2789

Closed
wants to merge 6 commits into from

Conversation

wlynch
Copy link

@wlynch wlynch commented Mar 24, 2023

What kind of change does this PR introduce?

Adds DangerousWorkflow check for imposter commits.

See https://www.chainguard.dev/unchained/what-the-fork-imposter-commits-in-github-actions-and-ci-cd

This borrows its implementation from
https://github.com/chainguard-dev/clank to look up imposter commits for
a repo.

(Is it a bug fix, feature, docs update, something else?)

What is the current behavior?

n/a (new check)

What is the new behavior (if this is a feature change)?**

  • Tests for the changes have been added (for bug fixes/features)

Which issue(s) this PR fixes

Fixes #2733

Special notes for your reviewer

GitHub e2e test is failing because the Actions yaml is testing against isn't actually valid (it references non-existent repos). We should fix this, but figure get this PR out for review.

Does this PR introduce a user-facing change?

For user-facing changes, please add a concise, human-readable release note to
the release-note

(In particular, describe what changes users might need to make in their
application as a result of this pull request.)

Adds new DangerousWorkflow check for detecting imposter commits.

@codecov
Copy link

codecov bot commented Mar 24, 2023

Codecov Report

Merging #2789 (b1b6b72) into main (b6362b1) will decrease coverage by 0.06%.
The diff coverage is 45.88%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2789      +/-   ##
==========================================
- Coverage   49.21%   49.15%   -0.06%     
==========================================
  Files         158      158              
  Lines       11967    12129     +162     
==========================================
+ Hits         5889     5962      +73     
- Misses       5709     5791      +82     
- Partials      369      376       +7     

Copy link
Member

@spencerschrock spencerschrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review, but wanted to get some comments out that address some big blockers with this current implementation. This change will likely require converting Dangerous-Workflow away from a file content only check:

func init() {
supportedRequestTypes := []checker.RequestType{
checker.FileBased,
checker.CommitBased,
}

I'll also post some thoughts in the main issue for this PR that are better suited there.

checks/raw/dangerous_workflow.go Show resolved Hide resolved
checks/raw/dangerous_workflow.go Outdated Show resolved Hide resolved
Comment on lines 129 to 135
newClient := &Client{
ctx: client.ctx,
repoClient: client.repoClient,
graphClient: client.graphClient,
contributors: client.contributors,
branches: client.branches,
releases: client.releases,
workflows: client.workflows,
checkruns: client.checkruns,
statuses: client.statuses,
search: client.search,
searchCommits: client.searchCommits,
webhook: client.webhook,
languages: client.languages,
licenses: client.licenses,
tarball: client.tarball,
}
repo, err := MakeGithubRepo(inputRepo)
if err != nil {
return nil, err
}
if err := newClient.InitRepo(repo, commitSHA, commitDepth); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about what this InitRepo call will do to the state of the parent client since all of the struct fields are pointers.

For example, InitRepo will call client.licenses.init(client.ctx, client.repourl) which will wipe any data seen already, reset some sync primitives, etc.

func (handler *licensesHandler) init(ctx context.Context, repourl *repoURL) {
	handler.ctx = ctx
	handler.repourl = repourl
	handler.errSetup = nil
	handler.once = new(sync.Once)
	handler.licenses = nil
}

There are all sorts of nasty race conditions and duplicated work waiting to happen here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this to use the same handler creation logic as CreateGithubRepoClientWithTransport. PTAL!

clients/gitlabrepo/client.go Show resolved Hide resolved
@spencerschrock
Copy link
Member

I agree with the need for a sub-RepoClient, as we're trying to answer the question "does this commit belong to repo B" but I'm not sure how I feel about the current approach. A new method like NewClient makes sense since RepoClient might be a github/gitlab/local etc, but I'm wondering if there's a way we can make the child RepoClient lighter.

If we modify RepoClient so the struct fields are optional that could help:

e.g.

if client.graphClient != nil {
    client.graphClient.init(client.ctx, client.repourl, client.commitDepth)
}

In the GitHub implementation for example, the ContainsRevision method only uses the branchesHandler handler. It does it to read repourl which is already a member of Client, so we could even eliminate the need to use branchesHandler.

@spencerschrock
Copy link
Member

I think it makes sense to discuss this some more before re-implementing things.

@wlynch
Copy link
Author

wlynch commented Apr 3, 2023

For the repo client - good to know that the handlers are stateful. I think we could probably get a fresh client if we just refactor some of the client initialization from here -

return &Client{
ctx: ctx,
repoClient: client,
graphClient: &graphqlHandler{
client: graphClient,
},
contributors: &contributorsHandler{
ghClient: client,
},
branches: &branchesHandler{
ghClient: client,
graphClient: graphClient,
},
releases: &releasesHandler{
client: client,
},
workflows: &workflowsHandler{
client: client,
},
checkruns: &checkrunsHandler{
client: client,
graphClient: graphClient,
},
statuses: &statusesHandler{
client: client,
},
search: &searchHandler{
ghClient: client,
},
searchCommits: &searchCommitsHandler{
ghClient: client,
},
webhook: &webhookHandler{
ghClient: client,
},
languages: &languagesHandler{
ghclient: client,
},
licenses: &licensesHandler{
ghclient: client,
},
tarball: tarballHandler{
httpClient: httpClient,
},
}

The main piece we want to reuse is the authed client, which we can pick out from the client struct. I'll take a pass at it. (also open to other ideas)

wlynch added 4 commits April 3, 2023 17:23
See https://www.chainguard.dev/unchained/what-the-fork-imposter-commits-in-github-actions-and-ci-cd

This borrows its implementation from
https://github.com/chainguard-dev/clank to look up imposter commits for
a repo.

Signed-off-by: Billy Lynch <billy@chainguard.dev>
This limits the number of calls made instead of probing every branch.

Signed-off-by: Billy Lynch <billy@chainguard.dev>
Signed-off-by: Billy Lynch <billy@chainguard.dev>
Signed-off-by: Billy Lynch <billy@chainguard.dev>
@wlynch wlynch force-pushed the imposter-commits branch 2 times, most recently from 678ed46 to b807b44 Compare April 3, 2023 21:25
Signed-off-by: Billy Lynch <billy@chainguard.dev>
@wlynch wlynch force-pushed the imposter-commits branch from b807b44 to bd350bf Compare April 3, 2023 21:27
@wlynch
Copy link
Author

wlynch commented Apr 3, 2023

This change will likely require converting Dangerous-Workflow away from a file content only check:

Looks like CommitBased / FileBased are the only 2 types at the moment.

const (
// FileBased request types require checks to run solely on file-content.
FileBased RequestType = iota
// CommitBased request types require checks to run on non-HEAD commit content.
CommitBased
)

I assume we'll need to make a new one? Any suggestions for what this new type should be? APIBased?

@spencerschrock
Copy link
Member

spencerschrock commented Apr 3, 2023

This change will likely require converting Dangerous-Workflow away from a file content only check:

Looks like CommitBased / FileBased are the only 2 types at the moment.

const (
// FileBased request types require checks to run solely on file-content.
FileBased RequestType = iota
// CommitBased request types require checks to run on non-HEAD commit content.
CommitBased
)

I assume we'll need to make a new one? Any suggestions for what this new type should be? APIBased?

I meant more along the lines of removing FileBased from the Dangerous-Workflow check as it now requires the API call. I don't think we need an APIBased as every check is assumed to be API based unless FileBased is declared.

func init() {
supportedRequestTypes := []checker.RequestType{
checker.FileBased,
checker.CommitBased,
}

Which is a little unfortunate, as there's still plenty of functionality which works on the file content. Scorecard would benefit from some sort of "optional API call" as once a check removes support for FileBased it can't be run locally. But that's outside the scope of this PR

go run main.go --local=. --checks Dangerous-Workflow
Error: GetEnabled: internal error: Unsupported RequestType [0] by check: Dangerous-Workflow
2023/04/03 14:35:00 error during command execution: GetEnabled: internal error: Unsupported RequestType [0] by check: Dangerous-Workflow

@spencerschrock
Copy link
Member

For the repo client - good to know that the handlers are stateful. I think we could probably get a fresh client if we just refactor some of the client initialization from here -

return &Client{
ctx: ctx,
repoClient: client,
graphClient: &graphqlHandler{
client: graphClient,
},
contributors: &contributorsHandler{
ghClient: client,
},
branches: &branchesHandler{
ghClient: client,
graphClient: graphClient,
},
releases: &releasesHandler{
client: client,
},
workflows: &workflowsHandler{
client: client,
},
checkruns: &checkrunsHandler{
client: client,
graphClient: graphClient,
},
statuses: &statusesHandler{
client: client,
},
search: &searchHandler{
ghClient: client,
},
searchCommits: &searchCommitsHandler{
ghClient: client,
},
webhook: &webhookHandler{
ghClient: client,
},
languages: &languagesHandler{
ghclient: client,
},
licenses: &licensesHandler{
ghclient: client,
},
tarball: tarballHandler{
httpClient: httpClient,
},
}

The main piece we want to reuse is the authed client, which we can pick out from the client struct. I'll take a pass at it. (also open to other ideas)

Agree that extracting the auth'd transport is the main bit we care about. I'll take a look at any follow-up commits later.

We can always revisit the optional init() comment if performance becomes a problem. The tarball handler's init function for example downloads the tarball and extracts it to a /tmp directory. So all of these subclients would be doing extra work when we really just care about is making an API call using the auth from the main client.

@spencerschrock
Copy link
Member

spencerschrock commented Apr 3, 2023

Disregard about the tarball handler, it's loaded lazily so shouldn't be a problem

We need to remove this because we need to make API calls to verify
commit reachability for imposter commits.

We may want to look into ways to break this up so that the pieces that
don't need API access can still run locally.

Signed-off-by: Billy Lynch <billy@chainguard.dev>
@wlynch wlynch requested a review from spencerschrock April 3, 2023 21:46
@wlynch wlynch had a problem deploying to integration-test April 4, 2023 17:09 — with GitHub Actions Failure
Copy link
Member

@spencerschrock spencerschrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewClient approach looks good to me.

I've left some comments on the ref parsing, with regard to re-usable workflows (which should be vulnerable to imposter commits) as well as filtering out non commit SHA refs.

And finally, the increased quota would likely be a problem for running this check in the cron so want to get some input from @azeemshaikh38 on that one.

Comment on lines -29 to -33
supportedRequestTypes := []checker.RequestType{
checker.FileBased,
checker.CommitBased,
}
if err := registerCheck(CheckDangerousWorkflow, DangerousWorkflow, supportedRequestTypes); err != nil {
Copy link
Member

@spencerschrock spencerschrock Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is still checker.CommitBased in my opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


// If not, query subrepo for commit reachability.
// Make new client for referenced repo.
subclient, err := c.client.NewClient(repo, "", 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitSHA probably shouldn't be "" here, perhaps clients.HeadSHA?

@@ -52,6 +53,7 @@ type RepoClient interface {
ListStatuses(ref string) ([]Status, error)
ListWebhooks() ([]Webhook, error)
ListProgrammingLanguages() ([]Language, error)
ContainsRevision(base, target string) (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if clients are initialized at a given SHA:
e.g. InitRepo and NewClient take in a commitSHA arg, do we need base as an arg to ContainsRevision?

Comment on lines +89 to +97
for _, repo := range []string{
"http://github.com/actions/checkout",
"http://github.com/ossf-tests/scorecard-check-dangerous-workflow-e2e",
} {
_, e := git.PlainClone(tmpDir, false, &git.CloneOptions{
URL: repo,
})
Expect(e).Should(BeNil())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these break since the check is no-longer FileBased

pdata *checker.DangerousWorkflowData,
) error {
ctx := context.TODO()
cache := &containsCache{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the perspective of the cron, it would be nice for the cron if the cache persisted between calls. I'm curious how much overlap there would be.

Not sure what the best approach would be

cc @azeemshaikh38

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, running just the Dangerous-Workflow check on ossf/scorecard consumes 93 core REST quota.

39 for urllib3/urllib3
73 for tensorflow/tensorflow

It's going to be very repo-dependent, based on their CI.

return sce.WithMessage(sce.ErrorCheckRuntime, fmt.Sprintf("unexpected repo reference: %s", s[0]))
}
repo := strings.Join(repoSplit[:2], "/")
sha := s[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a check here for if the SHA is actually a sha? Because this could be a tag too? which wouldn't need the imposter commit verification.

The branch protection check uses something like this to check for it:

// as a package level variable
var reCommitSHA = regexp.MustCompile("^[a-f0-9]{40}$")

...
// when testing
if !reCommitSHA.MatchString(foo) {
	continue
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also need to force sha to lowercase, or account for it in the regex

Comment on lines +294 to +295
for _, job := range workflow.Jobs {
for _, step := range job.Steps {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't cover re-usable workflows ( i assume they are also vulnerable). Would it make sense to loop over Uses.Values from both jobs (job.WorkflowCall.Uses.Value) and steps to get them into a slice? And then doing the same analysis over everything in the slice?

https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_iduses

I was trying to check how the code handled this job and it wasn't being checked:

provenance:
needs: [goreleaser]
permissions:
actions: read # To read the workflow path.
id-token: write # To sign the provenance.
contents: write # To add assets to a release.
uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v1.4.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusable workflows can't be hash pinned. Would be interesting to check whether the imposter commit holds true for reusable workflows too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken from the docs:

{ref} can be a SHA, a release tag, or a branch name.

And an experiment I did last year for a different reason
https://github.com/spencerschrock/reusable-workflow-caller/blob/aa0ccd0b1d5255d79a7ba32fd729a1db93d2f124/.github/workflows/scorecard.yml#L30

Copy link
Contributor

@azeemshaikh38 azeemshaikh38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comments about code architecture:

  • We could avoid changes to the RepoClient interface altogether and instead inject a new dependency in RunScorecard fn.
  • For the cron job, we can later optimize this injected dependency with a more API efficient model (like a blobstore-based cache maybe) while regular CLI continues to use REST API.
  • Flag-guard this change so that it does not get rolled out to prod without some testing on our end.

Comment on lines -29 to -33
supportedRequestTypes := []checker.RequestType{
checker.FileBased,
checker.CommitBased,
}
if err := registerCheck(CheckDangerousWorkflow, DangerousWorkflow, supportedRequestTypes); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -30,6 +30,7 @@ const HeadSHA = "HEAD"
// RepoClient interface is used by Scorecard checks to access a repo.
type RepoClient interface {
InitRepo(repo Repo, commitSHA string, commitDepth int) error
NewClient(repo string, commitSHA string, commitDepth int) (RepoClient, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the usecase for NewClient API. Could we not use CreateGitHubRepoClient instead?

Comment on lines +294 to +295
for _, job := range workflow.Jobs {
for _, step := range job.Steps {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusable workflows can't be hash pinned. Would be interesting to check whether the imposter commit holds true for reusable workflows too.

@github-actions
Copy link

Stale pull request message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Dangerous Workflows - Imposter Commit Check
3 participants